15 February 2018

A general workflow for sports analytics

Determine the need through collaboration

Articulate the need as a question

Scope out the 'minimum viable product'

Allow time for peer review

Communicate the findings in appropriate ways

Why use R for sports analytics?

(or similar programming languages)

Excel: The power and pitfalls of an 'all-in-one' tool

Data, analysis outputs, and visualisation are stored in one place

Data is easily manipulated - high risk of introducing errors

Point-and-click interface - manual repetition

Actions are not logged

Slow when working with large data sets or files storing many computations

Complex Excel workbooks can 'break' when shared with others

The case for programming

Separation of data, analysis outputs, visualisations, and reports

Reproducible, transparent, and less error-prone workflow

Much faster for repetitive actions than point-and-click

Requires you to think early and often about data structure

Procedures are documented via scripts and command history

Build up a library of repeatable, adaptable procedures

Now for some demos!

First, a quick intro to R…

R is a statistical programming language, RStudio is an IDE for using R

Its most fundamental component is 'base R' but it is highly extensible

R is free and open source: users benefit from packages created by others

The R community is global and active - lots of help for when you get stuck!

Demo 1

Summarising data from the Winter Olympics

Demo 1 uses the winter data set downloadable from this link.

It includes all podium results from Winter Olympic Games that have taken place from 1924 to 2014, inclusive.


Import data

# The readr package is efficient for reading (importing) of CSV files
library(readr)

# Read the data in
winter <- read_csv("winter.csv", col_names = TRUE, col_types = NULL)

Check the data

head(winter)
## # A tibble: 6 x 9
##    Year City     Sport    Discipline Athlete   Country Gender Event  Medal
##   <int> <chr>    <chr>    <chr>      <chr>     <chr>   <chr>  <chr>  <chr>
## 1  1924 Chamonix Biathlon Biathlon   BERTHET,~ FRA     Men    Milit~ Bron~
## 2  1924 Chamonix Biathlon Biathlon   MANDRILL~ FRA     Men    Milit~ Bron~
## 3  1924 Chamonix Biathlon Biathlon   MANDRILL~ FRA     Men    Milit~ Bron~
## 4  1924 Chamonix Biathlon Biathlon   VANDELLE~ FRA     Men    Milit~ Bron~
## 5  1924 Chamonix Biathlon Biathlon   AUFDENBL~ SUI     Men    Milit~ Gold 
## 6  1924 Chamonix Biathlon Biathlon   JULEN, A~ SUI     Men    Milit~ Gold

Subset the data

Let's say we are interested in comparing the number of gold medals won by Canada, Norway, and Sweden from the last five Winter Olympic Games.

We need to subset the data to focus only on:

  • Gold medal results
  • Athletes from Canada (CAN), Norway (NOR), or Sweden (SWE), and
  • Results from the last five Winter Olympics (1998 to 2014)

Subset the data

# The dplyr package includes useful data manipulation functions
library(dplyr)

# Subset the data
goldMedalComparison <- winter %>%
  filter(Medal == "Gold"
         & Country %in% c("CAN", "NOR", "SWE")
         & Year >= 1998)

Check the data

goldMedalComparison
## # A tibble: 358 x 9
##     Year City   Sport     Discipline   Athlete  Country Gender Event Medal
##    <int> <chr>  <chr>     <chr>        <chr>    <chr>   <chr>  <chr> <chr>
##  1  1998 Nagano Biathlon  Biathlon     BJOERND~ NOR     Men    10KM  Gold 
##  2  1998 Nagano Biathlon  Biathlon     HANEVOL~ NOR     Men    20KM  Gold 
##  3  1998 Nagano Bobsleigh Bobsleigh    LUEDERS~ CAN     Men    Two-~ Gold 
##  4  1998 Nagano Bobsleigh Bobsleigh    MACEACH~ CAN     Men    Two-~ Gold 
##  5  1998 Nagano Curling   Curling      BETKER,~ CAN     Women  Curl~ Gold 
##  6  1998 Nagano Curling   Curling      FORD, A~ CAN     Women  Curl~ Gold 
##  7  1998 Nagano Curling   Curling      GUDEREI~ CAN     Women  Curl~ Gold 
##  8  1998 Nagano Curling   Curling      MCCUSKE~ CAN     Women  Curl~ Gold 
##  9  1998 Nagano Curling   Curling      SCHMIRL~ CAN     Women  Curl~ Gold 
## 10  1998 Nagano Skating   Short Track~ BEDARD,~ CAN     Men    5000~ Gold 
## # ... with 348 more rows

Wrangle the data

In the winter data set, the data is structured such that one row is one medal-winning athlete.

However…

  • Team events (e.g., bobsleigh) comprise multiple athletes, and
  • Within teams that achieve a podium finish, each athlete is awarded a medal.

For this analysis, we need to wrangle the data to get it into a format where one row represents one gold medal per event (rather than per athlete).

Wrangle the data

# Create a new variable to identify each unique event
goldMedalComparison$uniqueEventID <-
  paste(goldMedalComparison$Year, goldMedalComparison$Sport,
        goldMedalComparison$Discipline, goldMedalComparison$Country,
        goldMedalComparison$Gender, goldMedalComparison$Event, sep="_")

Wrangle the data

goldMedalComparison %>%
  select(uniqueEventID)
## # A tibble: 358 x 1
##    uniqueEventID                                             
##    <chr>                                                     
##  1 1998_Biathlon_Biathlon_NOR_Men_10KM                       
##  2 1998_Biathlon_Biathlon_NOR_Men_20KM                       
##  3 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man                  
##  4 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man                  
##  5 1998_Curling_Curling_CAN_Women_Curling                    
##  6 1998_Curling_Curling_CAN_Women_Curling                    
##  7 1998_Curling_Curling_CAN_Women_Curling                    
##  8 1998_Curling_Curling_CAN_Women_Curling                    
##  9 1998_Curling_Curling_CAN_Women_Curling                    
## 10 1998_Skating_Short Track Speed Skating_CAN_Men_5000M Relay
## # ... with 348 more rows

Wrangle the data

# Identify duplicates
goldMedalComparison$duplicates <- duplicated(goldMedalComparison$uniqueEventID)

goldMedalComparison %>%
  select(uniqueEventID, duplicates)
## # A tibble: 358 x 2
##    uniqueEventID                                              duplicates
##    <chr>                                                      <lgl>     
##  1 1998_Biathlon_Biathlon_NOR_Men_10KM                        F         
##  2 1998_Biathlon_Biathlon_NOR_Men_20KM                        F         
##  3 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man                   F         
##  4 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man                   T         
##  5 1998_Curling_Curling_CAN_Women_Curling                     F         
##  6 1998_Curling_Curling_CAN_Women_Curling                     T         
##  7 1998_Curling_Curling_CAN_Women_Curling                     T         
##  8 1998_Curling_Curling_CAN_Women_Curling                     T         
##  9 1998_Curling_Curling_CAN_Women_Curling                     T         
## 10 1998_Skating_Short Track Speed Skating_CAN_Men_5000M Relay F         
## # ... with 348 more rows

Wrangle the data

# Remove duplicates from the data
# and remove temporary variables used for identifying duplicates
goldMedalComparison <- goldMedalComparison %>%
  filter(duplicates == FALSE) %>%
  select(-uniqueEventID, -duplicates)

Wrangle the data

goldMedalComparison %>%
  select(Year, Sport, Country, Event)
## # A tibble: 105 x 4
##     Year Sport     Country Event       
##    <int> <chr>     <chr>   <chr>       
##  1  1998 Biathlon  NOR     10KM        
##  2  1998 Biathlon  NOR     20KM        
##  3  1998 Bobsleigh CAN     Two-Man     
##  4  1998 Curling   CAN     Curling     
##  5  1998 Skating   CAN     5000M Relay 
##  6  1998 Skating   CAN     500M        
##  7  1998 Skating   NOR     1500M       
##  8  1998 Skating   CAN     500M        
##  9  1998 Skiing    NOR     Slalom      
## 10  1998 Skiing    NOR     10KM Pursuit
## # ... with 95 more rows

Calculate summary statistics

The data is now in a format that is ready for calculating summary statistics.

For instance, we can tabulate the total number of gold medals won by Canada, Norway, and Sweden at each of the Winter Games since 1998.

Calculate summary statistics

# The reshape2 package is useful for manipulating data
# from long-to-wide format and vice versa
library(reshape2)

# Calculate total number of gold medals per team per Winter Games
goldMedalTotals <- goldMedalComparison %>%
  group_by(Year, City, Country) %>%
  summarise(goldMedalTotal = length(Medal)) %>%
  ungroup()

# Reshape the data from long-to-wide format for tabulating
goldMedalTotalsWide <- dcast(goldMedalTotals,
                             Year + City ~ Country,
                             value.var = "goldMedalTotal")

Calculate summary statistics

# Check the reshaped data
goldMedalTotalsWide
##   Year           City CAN NOR SWE
## 1 1998         Nagano   6  10  NA
## 2 2002 Salt Lake City   8  12  NA
## 3 2006          Turin   7   2   7
## 4 2010      Vancouver  15   9   5
## 5 2014          Sochi  10  12   2
# NAs appear in the data because Sweden did not win any gold medals
# at the 1998 and 2002 Winter Games

# Replace these NAs with zeroes
goldMedalTotalsWide$SWE[is.na(goldMedalTotalsWide$SWE)] <- 0

Calculate summary statistics

# The DT package is used to display interactive tables in HTML documents
library(DT)

# Display the table
datatable(goldMedalTotalsWide, options = list(dom = 't'), rownames = FALSE)

Calculate summary statistics

We can use the data frame we've just created - gold medal totals per team per Games - to calculate the median number of gold medals won by Canada, Norway, and Sweden at the last five Games.

goldMedalMedians <- goldMedalTotalsWide %>%
  summarise(CAN = median(CAN),
            NOR = median(NOR),
            SWE = median(SWE))

goldMedalMedians
##   CAN NOR SWE
## 1   8  10   2

Demo 2

Visualising AFLW data

For demo 2, we will use data from season 1 of the AFLW to show how sports data can be visualised using R.

Import the data

# This data was obtained from this Github repository:
# https://github.com/ropenscilabs/ozwomensport/tree/master/AFLW
aflwTeamStats <- read_csv("teams.csv", col_names = TRUE, col_types = NULL)

head(aflwTeamStats)
## # A tibble: 6 x 49
##   Club       Kicks_TOT Kicks_AVG Handballs_TOT Handballs_AVG Disposals_TOT
##   <chr>          <int>     <dbl>         <int>         <dbl>         <int>
## 1 Adelaide ~      1052       132           458          57.3          1510
## 2 All Clubs       6840       118          3478          60.2         10318
## 3 Brisbane ~       977       122           416          52.0          1393
## 4 Carlton          780       111           424          60.6          1204
## 5 Collingwo~       838       120           335          47.9          1173
## 6 Fremantle        818       117           409          58.4          1227
## # ... with 43 more variables: Disposals_AVG <dbl>, `Cont Poss_TOT` <int>,
## #   `Cont Poss_AVG` <dbl>, `Uncont Poss_TOT` <int>, `Uncont
## #   Poss_AVG` <dbl>, `Disp eff %_TOT` <dbl>, Clangers_TOT <int>,
## #   Clangers_AVG <dbl>, Marks_TOT <int>, Marks_AVG <dbl>, `Cont
## #   marks_TOT` <int>, `Cont marks_AVG` <dbl>, Marks50_TOT <int>,
## #   Marks50_AVG <dbl>, `Hit-outs_TOT` <int>, `Hit-outs_AVG` <dbl>,
## #   Clearances_TOT <int>, Clearances_AVG <dbl>, `Centre clr_TOT` <int>,
## #   `Centre clr_AVG` <dbl>, Stoppages_TOT <int>, Stoppages_AVG <dbl>,
## #   Rebound50_TOT <int>, Rebound50_AVG <dbl>, `Frees For_TOT` <int>,
## #   `Frees For_AVG` <dbl>, `Frees Agst_TOT` <int>, `Frees Agst_AVG` <dbl>,
## #   `One %s_TOT` <int>, `One %s_AVG` <dbl>, Bounces_TOT <int>,
## #   Bounces_AVG <dbl>, Inside50_TOT <int>, Inside50_AVG <dbl>,
## #   Behinds_TOT <int>, Behinds_AVG <dbl>, `Goal assists_TOT` <int>, `Goal
## #   assists_AVG` <dbl>, `Goal acc %_TOT` <dbl>, `AFL Fantasy_TOT` <int>,
## #   `AFL Fantasy_AVG` <dbl>, Interchanges_TOT <int>,
## #   Interchanges_AVG <dbl>

Subset the data

Let's compare some basic stats across clubs. First, we need to prepare the data.

aflwTeamStats <- aflwTeamStats %>%
  filter(Club != "All Clubs")

Plot the data

# The ggplot2 package is widely used for plotting in R
# It uses a layer-by-layer approach to building plots
library(ggplot2)

# Sort by average kicks per game
aflwTeamStats <- aflwTeamStats[order(aflwTeamStats$Kicks_AVG), ]

# Build the plot
p <- ggplot(aflwTeamStats, aes(x = reorder(Club, -Kicks_AVG), y = Kicks_AVG, fill = Club))
p <- p + geom_bar(stat = "identity")
p <- p + scale_y_continuous(limits = c(0,150),
                            breaks = c(0,25,50,75,100,125,150))
p <- p + labs(title = "Comparing average kicks per game across AFLW teams",
              x = "Club",
              y = "Average kicks per game")
p <- p + theme(legend.position = "none",
               axis.text.x  = element_text(angle=90, vjust=0.5))
aflwKicksPlot <- p

Plot the data

Other neat things you can do in R

Statistical analyses

This presentation is not a deep dive into statistics, but R is more than capable of running a vast array of stats including:

  • Hypothesis testing techniques
  • Effect sizes
  • Mixed models
  • Regression and classification
  • Network analysis
  • Time series analysis
  • And much much more…

Text mining

Geospatial plots

Geospatial plots

Interactive dashboards

Appendix 1: Learning R

Starter resources

Try R: An in-the-browser tutorial for getting to know the R language - no software installation needed.

'Introduction to R': Free course on Data Camp.

'R for Data Science', by Garrett Grolemund & Hadley Wickham.

'Data wrangling, exploration, and analysis with R': Course materials from a unit taught at the University of British Columbia by Prof Jenny Bryan.

'R for data analysis and visualisation': Data Carpentry lesson using ecology data

The Data Science Guide: A useful resource for finding content all over the web for learning different data science skills and programming languages. You can tailor your search to your current level of expertise in a given domain.

Transitioning from Excel to R

Appendix 2: The R community

The R community

Follow the #rstats hashtag on Twitter

Many prominent R users and developers have blogs and / or tweet regularly. Here are some of my favourites:

@JennyBryan | @ma_salmon | @alice_data | @dataandme | @visnut | @kierisi | @thomasp85 | @xieyihui | @drob | @juliasilge

Join, read, and contribute to the R Studio Community discussions: https://community.rstudio.com/

You can discover lots of weird, wonderful, and useful R packages and repositories on Github: https://github.com/trending/r

Appendix 3: Sports data sets

'Ready to analyse'

Needs scraping / cleaning

Other references and resources

References and resources